Machine Learning Augmented Interpretation of Chest X-rays: A Systematic Review

Limitations of the chest X-ray (CXR) have resulted in attempts to create machine learning systems to assist clinicians and improve interpretation accuracy. An understanding of the capabilities and limitations of modern machine learning systems is necessary for clinicians as these tools begin to permeate practice. This systematic review aimed to provide an overview of machine learning applications designed to facilitate CXR interpretation. A systematic search strategy was executed to identify research into machine learning algorithms capable of detecting >2 radiographic findings on CXRs published between January 2020 and September 2022. Model details and study characteristics, including risk of bias and quality, were summarized. Initially, 2248 articles were retrieved, with 46 included in the final review. Published models demonstrated strong standalone performance and were typically as accurate, or more accurate, than radiologists or non-radiologist clinicians. Multiple studies demonstrated an improvement in the clinical finding classification performance of clinicians when models acted as a diagnostic assistance device. Device performance was compared with that of clinicians in 30% of studies, while effects on clinical perception and diagnosis were evaluated in 19%. Only one study was prospectively run. On average, 128,662 images were used to train and validate models. Most classified less than eight clinical findings, while the three most comprehensive models classified 54, 72, and 124 findings. This review suggests that machine learning devices designed to facilitate CXR interpretation perform strongly, improve the detection performance of clinicians, and improve the efficiency of radiology workflow. Several limitations were identified, and clinician involvement and expertise will be key to driving the safe implementation of quality CXR machine learning systems.


Introduction
Chest X-rays (CXRs) have been used as the baseline chest imaging modality for more than a century [1]. This relatively simple method of image acquisition has provided access to radiological investigation of chest pathology to almost every corner of the globe, encompassing the investigation of infection, cardiac pathology, chest trauma, and malignancy.
The development of safe principles of ionizing radiation usage and advancements in the acquisition of digital images have led to reduced radiation exposure, improved image quality, and wider CXR availability. The CXR remains the most frequently performed medical imaging investigation worldwide [2].
There are, however, limitations to the diagnostic utility of the CXR. Soft tissue contrast assessment is limited by the projection of X-rays through multiple organs and the generation of a two-dimensional image with superimposed densities, which can lead to reduced sensitivity for subtle findings [3]. This makes CXR interpretation particularly challenging and, as a result, most cases of missed lung cancer appear to be due to errors in CXR interpretation [4]. Human error, reader inexperience, fatigue, and interruptions contribute to interpretation inaccuracy [3,5], and the availability of experienced thoracic radiologists is limited. Other imaging modalities are capable of providing high-sensitivity visualizations of the chest, including computed tomography (CT) and ultrasound. These modalities have been shown to have higher sensitivity for many findings, including pneumothorax [6], pneumonia [7], and lung nodules [8]. However, due to widespread availability, short scan time, low cost, and low radiation exposure, the CXR remains the first line of imaging modality for chest assessment [9]. For these reasons, there have been many attempts to create artificial intelligence (AI) systems to assist radiologists in the task of CXR interpretation [10,11].
Machine learning, a subdomain of AI that involves learning patterns in data to enable effective prediction and classification, is profoundly influencing care delivery across medical specialties from pathology to radiology [12][13][14][15][16]. Deep learning image processing algorithms are based on convolutional neural networks (CNNs) and have been trained to detect pneumothorax [17], pneumonia [18], COVID-19 [19][20][21][22][23][24], pneumoconiosis [25], tuberculosis [26], and lung cancer [27]. Models have been developed to automate lung segmentation and bone exclusion [28], identify the position of feeding tubes [29], and predict temporal changes in imaging findings [30]. While these studies have not assessed the usefulness of AI models across many findings simultaneously, they have shown that deep learning diagnostic tools can improve the classification performance of radiologists in the detection of pulmonary nodules [31], pneumoconiosis [25], pneumonia [18], emphysema [10], and pleural effusion [32]. Coupling AI models with clinicians can result in higher diagnostic accuracy performance than either AI or clinicians alone [33]. In addition to this, they appear to improve reporting efficiency by reducing interpretation time [18].
Most deep learning systems developed to date, however, have been limited in scope, often to a single or a few findings [10,34]. While demonstrating high performance within their narrow application domains, their lack of clinical breadth may limit their utility in practice. Concerns have also been raised regarding potential risks and biases that may accompany the use of deep learning systems for image interpretation assistance, such as poor generalizability across populations [35] and automation bias [36].
The application of machine learning on chest X-rays to assist in the diagnosis of COVID-19 was a real-world example that highlighted both the benefits and pitfalls of medical imaging AI. Multiple algorithms have been developed for this purpose in recent years and have demonstrated high levels of accuracy in standalone tests [19][20][21][22][23][24]. However, the performance of some COVID-19 machine learning models has been shown to suffer when applied to datasets more representative of real-world cohorts [37], attributed, in part at least, to the issue of hidden stratification and confounded training data. A broad understanding of the modern capabilities of AI systems applied to CXR interpretation, as well as potential limitations, will be necessary for clinicians as these and similar tools are introduced into their workflow in the coming years.
To that end, this literature review aimed to provide a contemporary and comprehensive overview of deep learning applications designed to facilitate CXR interpretation. Specifically, we sought to identify algorithm performance and scope, risks and benefits, and opportunities for future research and model development. Section 2 of this paper includes a description of our applied methods; Section 3, the results of our systematic review; and Section 4, a discussion of the implications of recent developments in this subdomain of applied machine learning in medicine.

Methods and Materials
The methods applied in this systematic review were guided by the standards of the Institute of Medicine [38] and the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines [39]. The prospective protocol was developed and approved by senior study authors. Risk of bias (ROB) within selected studies was assessed using PROBAST [40] (prediction model risk of bias assessment tool).

Search Strategy
A comprehensive search strategy was developed and applied to the PubMed and ScienceDirect databases. To collate a contemporary sample of the literature within the rapidly developing field of deep learning technology, studies published between January 2020 and September 2022 were identified. The search strategy was based on combinations of domain specific and methodological search terms, both keywords and Medical Subject Headings (MeSH) terms (Table 1).

Eligibility Criteria
Publications were selected for full text review if they satisfied inclusion criteria: original research published in a peer review journal; published in English; involved the application of machine learning techniques to facilitate CXR interpretation and diagnosis; involved the use of CXR image data; addressed multiple radiological findings relevant to CXRs (>2 pathologies); and included data from adult patients. Studies were included if they evaluated model performance with one or more of the following performance metrics: accuracy, area under the receiver operating characteristics curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1 score, and Matthews correlation coefficient (MCC). Articles were excluded if they were review articles, books, book chapters, or conference abstracts; did not involve the deployment of deep learning; did not involve the processing of CXR data; focused on a nonclinical application; or focused on data from a pediatric population.

Study Selection Process
Database searches were completed by one author, with all references imported and consolidated into a web-based bibliographic software package (Paperpile LLC, MA, USA). Citations and study details, including abstracts, were exported to a custom excel spreadsheet for data management. Keywords (e.g., "letter", "proceedings", "review", "computed tomography" (also "CT"), "pediatric"), were used to identify articles for exclusion. Duplicates were removed. Multiple authors (M.R.M, Q.D.B, H.K.A, N.E, G.S, H.C) conducted manual screening to exclude titles and abstracts that did not meet predefined eligibility criteria. Two review authors (M.R.M, J.C) repeated this manual screening review on a 20% sample of the identified studies as a quality check. There was no disagreement between the main review process and the quality check. Studies passing the title and abstract screen underwent full text review and were appraised for inclusion. Disagreement or uncertainty regarding the inclusion of an article was resolved via discussion within the review team.

Data Extraction and Appraisal
For each included study, specific items for data extraction were collected and coded. These included study identifiers, study characteristics such as purpose, design, type, and setting, study outcomes and performance measurement, methods for model development and validation, machine learning algorithm characteristics, study results and findings, whether the article was a duplicate, whether the article was included or excluded, and the reason for exclusion. Included studies underwent an assessment of design and methodological quality using criteria defined in Table 3 and the PROBAST [40] tool.

Synthesis and Assessment
A PRISMA flow diagram [41] was produced to illustrate study screening, selection, and inclusion. Study and model details, including the number of clinical findings classified, design, dataset size, datasets used, validation techniques applied, performance metrics, and key findings were tabulated to facilitate analysis and benchmarking. Outcomes and key themes were summarized using descriptive statistics.

Included Articles
The search resulted in the retrieval of 2248 records ( Figure 1). We assessed 90 full text articles and included 46 in the final quantitative and qualitative analysis. Model and study details, along with ROB and quality, were summarized for each study. Diagnostics 2023, 13, 743 5 of 34

Summary of Included Articles
The literature review identified 46 primary studies that met inclusion criteria. Most studies employed a retrospective data analysis approach to investigate device performance (97%). Only one was conducted as a prospective study in a real-world environment (Jones et al., 2021 [42]). Device performance was compared with that of physicians in 14 of the 46 included studies (30%). Of these 14 studies, device augmentation effects on clinical perception and diagnosis were evaluated in 9 out of 14 of these studies (19%). A summary of included studies, their aims, design, datasets, and number of findings identified are outlined in Table 2.

Summary of Included Articles
The literature review identified 46 primary studies that met inclusion criteria. Most studies employed a retrospective data analysis approach to investigate device performance (97%). Only one was conducted as a prospective study in a real-world environment (Jones et al., 2021 [42]). Device performance was compared with that of physicians in 14 of the 46 included studies (30%). Of these 14 studies, device augmentation effects on clinical perception and diagnosis were evaluated in 9 out of 14 of these studies (19%). A summary of included studies, their aims, design, datasets, and number of findings identified are outlined in Table 2.  Chest X-ray14 [47] 10,738 14

Quality Appraisal and Risk of Bias
Included studies underwent a quality appraisal. Results of the assessment of study quality, including appraisal criteria and scores for article quality, are presented in Table 3. The quality of studies varied across assessment domains, with some studies demonstrating a marked lack of methodological quality. A total of 29 studies were considered high quality, with an overall quality score of 70, while 15 studies were considered moderate quality, with a score of 50-60. Two studies were low quality, with an overall score of 30-40. The most common factor adversely affecting study quality was the lack of an appropriate comparator for device performance. All studies demonstrated an appropriate design, whereas only 67% involved the use of appropriate comparators. Some studies involved training a single model and did not compare its performance to other baseline models or to clinicians. Model training datasets were of sufficient size and quality in 78% of studies. Likewise, appropriate validation methods were applied in 78% of studies. Often, training dataset characteristics and validation methods were not reported; however, this was not considered a negative indicator of study quality because several studies investigated commercial or previously established devices, and these details were reported in previous studies. Appropriate sample size, performance metrics, and statistical analysis techniques were prevalent, evident in 97%, 100%, and 93% of studies, respectively.
The PROBAST [40] ROB tool assessed shortcomings in study design, conduct, and analysis that may have put the results of a study at risk of being flawed or biased. Of the 46 studies assessed, 39 were determined to be at low risk of bias, 6 at high risk, and 1 study was of unclear risk [89] (Table A1 in Appendix A). Assessment across the four PROBAST domains presented as percentages are displayed in Figure 2. The primary contributor to the high ROB in these studies was associated with patient selection methods. The PROBAST [40] ROB tool assessed shortcomings in study design, conduct, and analysis that may have put the results of a study at risk of being flawed or biased. Of the 46 studies assessed, 39 were determined to be at low risk of bias, 6 at high risk, and 1 study was of unclear risk [89] (Table A1 in Appendix A). Assessment across the four PROBAST domains presented as percentages are displayed in Figure 2. The primary contributor to the high ROB in these studies was associated with patient selection methods.

Comprehensiveness and Algorithm Development
A clear theme that emerged from this systematic review was that machine learning models designed to facilitate CXR interpretation have become substantially more clinically comprehensive. Many models were only capable of classifying less than eight clinical findings. The mean number of clinical findings classified by models was 16, with a mode of 14. The frequency of models classifying 14 findings correlates to the recurring use of the Chest X-ray14 [47] dataset, which is labeled for 14 diseases. The top three most comprehensive CXR classification models, however, markedly exceeded these benchmarks.

Comprehensiveness and Algorithm Development
A clear theme that emerged from this systematic review was that machine learning models designed to facilitate CXR interpretation have become substantially more clinically comprehensive. Many models were only capable of classifying less than eight clinical findings. The mean number of clinical findings classified by models was 16, with a mode of 14. The frequency of models classifying 14 findings correlates to the recurring use of the Chest X-ray14 [47] dataset, which is labeled for 14 diseases. The top three most comprehensive CXR classification models, however, markedly exceeded these benchmarks. These models were Jadhav et al., 2020 [64] [42] with 124 findings. A breakdown of findings evaluated per device is presented in Figure 3. Algorithm architectures applied included UNet [96], DenseNet [97], ResNet [98], EfficientNet [99], and the VGG neural networks [100].  Figure 3. Algorithm architectures applied included UNet [96], DenseNet [97], ResNet [98], EfficientNet [99], and the VGG neural networks [100].

Data, Model Training, and Ground Truth Labeling
The development of effective comprehensive CXR machine learning models relies on access to large datasets. Of the studies that reported their training and validation dataset size, on average, 128,662 images were used to train and validate models. The most comprehensive CXR models encompassing more than 10 clinical findings have been based on just four public datasets: MIMIC [65], PadChest [75], Chest X-ray14 [47], and CheXpert [52]. Figure 4 illustrates the commonly used datasets in the studies identified.

Data, Model Training, and Ground Truth Labeling
The development of effective comprehensive CXR machine learning models relies on access to large datasets. Of the studies that reported their training and validation dataset size, on average, 128,662 images were used to train and validate models. The most comprehensive CXR models encompassing more than 10 clinical findings have been based on just four public datasets: MIMIC [65], PadChest [75], Chest X-ray14 [47], and CheXpert [52]. Figure 4 illustrates the commonly used datasets in the studies identified. Model validation methods varied but generally adhered to the standard three-way dataset split paradigm (train-validation-test). Limited studies conducted external validation on a dataset from a different setting than the training dataset. No models were validated in a randomized controlled trial. Ground truth processes employed by researchers varied. Most studies employed a consensus of radiologists (usually two to five) who often had access to CXR reports and, in some cases, were able to correlate CXRs with CTs. A triple consensus of general (rather than subspecialist) radiologists was the most common ground truth labeling approach.

Performance and Safety
Identified studies used several different indicators to assess device performance, with the most common of these being the measurement of finding detection accuracy (Table 4). Comparators included other CXR models and clinician readers.  Model validation methods varied but generally adhered to the standard three-way dataset split paradigm (train-validation-test). Limited studies conducted external validation on a dataset from a different setting than the training dataset. No models were validated in a randomized controlled trial. Ground truth processes employed by researchers varied. Most studies employed a consensus of radiologists (usually two to five) who often had access to CXR reports and, in some cases, were able to correlate CXRs with CTs. A triple consensus of general (rather than subspecialist) radiologists was the most common ground truth labeling approach.

Performance and Safety
Identified studies used several different indicators to assess device performance, with the most common of these being the measurement of finding detection accuracy (Table 4). Comparators included other CXR models and clinician readers.

Discussion
Machine learning applied to the analysis and interpretation of CXRs carries with it significant potential for clinical quality and safety improvement. The field is developing quickly. This study was designed to comprehensively assess the performance and scope of modern algorithms and their associated risks, benefits, and development opportunities. The 46 studies included in this systematic review offer an insight into emerging themes within the contemporary landscape of deep learning models designed to interpret CXRs. There are clear trends towards increasing device comprehensiveness and improving model performance.
Published models generally demonstrated strong performance for detecting a range of clinical findings on the CXR. Some demonstrated moderate performance and likely require further development before attempts are made to apply them to clinical practice. In contrast, one comprehensive model demonstrated standout performance, with an average AUC of 0.96 across 124 findings [83]. The next most comprehensive model, which was capable of detecting 72 findings, demonstrated an average AUC of 0.77 [11]. When compared with physician detection accuracy, the identified devices were typically found to be as accurate, or more accurate, than radiologist or non-radiologist clinicians [11,43,59,63,71,74,[81][82][83]88]. Taking this further, multiple studies demonstrated that use of well-trained and validated deep learning models can improve the clinical finding classification performance of clini-cians when acting as a diagnostic assistance device [42,43,57,62,66,74,83,87]. This points to the potential utility and impact of machine learning systems applied to clinical practice.
Transfer learning and open access to pretrained models and model architectures have underpinned the development of effective deep learning models in radiology. The continued development and optimization of these kinds of transferable models would be beneficial for facilitating further improvements in healthcare.
Another endpoint assessed by several studies was reporting and interpretation efficiency. Some included studies evaluated the performance of high-accuracy devices within the scope of developing triage or prioritization tools, which are designed to alert clinicians to cases suspected of containing time-sensitive findings. These devices have the potential to improve efficiency and patient safety by reducing the time between image acquisition and reporting by the physician. Simulation studies indicate that when these devices are used to triage studies, the report turnaround time (RTAT) of cases that include time-sensitive findings is significantly reduced [49,74]. In addition to RTAT, reporting time is another indicator used to measure efficiency. Several studies investigated the impact of AI-assisted interpretation on reading time, with some studies indicating that reporting time was reduced [43,74,87], while others found that reporting time was increased [42]. A demonstrable impact to patient outcomes may follow AI-enabled efficiency gains to radiology workflows; however, further research is necessary to establish the presence or extent of such benefits.
While the majority of studies were conducted on retrospective datasets, one study was conducted in a prospective real-world reporting environment and evaluated radiologist agreement and impact on clinical decision making due to device findings [42]. Results indicated that the radiologist and device were in complete agreement in 86.5% of cases, and device predictions led to significant report changes, changed patient management planning, and altered further imaging recommendations in 3.1%, 1.4%, and 1.0% of cases respectively. A similar retrospective study was conducted, producing comparable results [68]. In another study, a device was used to flag cases suspected of containing clinically significant findings that were initially labeled normal [62]. The device initially overlooked relevant abnormalities with a detection yield and a false referral rate of 2.4% and 14.0%, respectively.

Risk and Safety
Several recurring risks were highlighted by researchers including the potential for poor model generalizability, suboptimal case labeling, and the potential for data perturbation. The overfitting of CXR models has also been identified as a performance risk, leading to overestimation of performance or poor generalizability of machine learning models on external datasets [9]. External validation is an important issue in applied machine learning that has potential implications for patient safety. Some evidence suggested that high performing models may not generalize well [69]. In this review, only a limited number of included studies performed external validation of the evaluated device. Some studies reported significant drops in model performance when they were applied to external data [73,85]. These studies that reveal the so called 'generalization gap' underscore the need for vigilance by healthcare providers whenever efforts are made to translate machine learning models into clinical practice.
Limitations in availability of large, high quality, and accurately labeled CXR datasets can present a potential risk for developing and testing high performing and appropriately generalizable machine learning models [9]. More than half of included studies used training data from publicly available datasets originating solely from US patients (Chest X-ray14 [47], CheXpert [52], MIMIC [65]), while many others used curated private datasets with images from institutions limited to a single country or region [42,55,56,59,62,66,68,74,76,77,81,87,88]. A limited number of studies leveraged data from multiple countries [44,46,[82][83][84]95]. Additional generalizability studies are required to test and verify the performance of deep learning models across different patient populations. The ethical public release of large de-identified datasets may facilitate the development of higher quality and more generalizable machine learning systems.
Natural language processing (NLP) can be problematic and noisy when used for the generation of training or ground truth labels [53]. At present, several common public datasets use NLP on the original radiologist reports to identify pathology contained in CXR images [47]. Reports are often incomplete representations of clinical findings present in the associated imaging. NLP is, therefore, prone to inaccurate image annotation, leading to negative downstream effects on model training. For example, it has been reported that the NLP-generated labels in the ChestX-ray14 [47] dataset, which was used in 17 studies, do not accurately reflect the visual content of the CXR images [101]. Investments in high-quality data labeling by expert clinicians may serve to address this issue, but these activities are resource intensive.
Testing datasets should ideally be representative of the target population (e.g., include diverse demographic groups) and the target disease, condition, or abnormality for which the model is intended. The use of datasets that include limited patient subgroups or are enriched for particular findings may not reflect the true prevalence of a disease or condition in the real world, potentially leading to spectrum bias. Spectrum bias present in the dataset can lead to model generalizability issues, resulting in reduced performance and limited clinical applicability. Several studies were identified that may have been affected. Examples include datasets that contained only one or two findings per image [57,87], only included CXRs with an associated follow-up CT scan [66], and datasets that were hand-picked rather than consecutively selected [81,82]. Further work testing and demonstrating the generalizability characteristics of published models is warranted and will serve to reinforce user confidence and patient safety.
Another consideration is the potential negative influence of AI systems on physician decision making. Automation bias, where overreliance on automated systems may lead to false positives being overlooked or a reluctance to question the suggestions made by the AI model, appears to be a particular risk for less experienced clinicians. While these issues were not assessed empirically in the included studies, the issue was highlighted and discussed. Models with a high false positive rate may require greater clinical expertise to separate true from false positives [61,62,67]. Conversely, evidence also suggests that less experienced clinicians may see the greatest benefit from AI diagnostic assistance [42]. To mitigate the risk of automation bias, manufacturers are expected to clearly report the performance details of their AI assist devices, and clinicians are expected to understand the performance characteristics and limitations of the systems they use. When developing algorithms for real-world use, vendors should be aware of evolving evidence pertaining to the mitigation of automation bias, including implementation principles and interface design choices [102].
The quality of the dataset labeling method is likely to be a cornerstone of safe deep learning model development for systems intended for clinical use. Open source datasets may be vulnerable to adversarial perturbation, which can induce model failure or falsely high performance in image classification tasks [103]. Image perturbations are often difficult to detect. They can be extremely small (a few pixels) and hence may not substantially affect data distributions. Attention to data security controls is necessary for systems intended for clinical application.
AI-assisted triage may lead to longer RTAT in the case of false negatives through downprioritization of these cases and up-prioritization of cases with positive AI predictions. One study highlighted that there was a risk of false negatives leading to greatly increased RTAT for these studies, which would equate to a significant delay in patient treatment in the real world [49]. The performance characteristics and limitations of clinically applied models must be rigorously evaluated and clearly understood.

Benefits
The clinical benefits of AI models for medical image interpretation can be divided into two primary domains: improved accuracy in detecting pathology on the image, and improved reporting efficiency. Improved reporting accuracy was highlighted in numerous included studies [42,43,57,62,66,74,83,87]. This has the potential to reduce false positive and false negative rates and reduce unnecessary follow-up CT examinations and associated radiation exposure. This may lead to earlier finding detection and improved patient outcomes in screening, outpatient, emergency, and inpatient settings. While the majority of studies appeared to demonstrate improved physician performance with diagnostic device assistance, the device evaluated by Hwang and colleagues focused on detecting false negatives in CXRs originally interpreted as normal by radiologists [62]. CXRs with "normal" reports were assessed by the AI model. Researchers demonstrated a false referral rate of 0.97% and found that 1.2% contained salient clinical findings. Employing machine learning models to reduce false negative rates and improve the quality of reporting in this way will continue to be of interest to radiology providers as workload volume and complexity grow.
Several included studies demonstrated improved reporting efficiency, which coalesced into two primary categories. These were (1) reduced time to report studies that contain critical pathology [49,74] and (2) a reduction in reporting time per case [43,74,87]. An increase in reporting efficiency may impact patient outcomes by reducing the time to treatment for patients presenting with time-sensitive pathologies and increasing the rate at which physicians can report CXRs.
A further benefit identified was the ability for some AI models to provide consistent detection accuracy across variations in image quality. Some studies demonstrated that model performance was resilient to different image sources and suboptimal acquisition quality [57,60,82,88], demonstrating this kind of model resilience provides additional quality and safety assurance to the practicing clinician.
In addition to the benefits outlined above, a study conducted by Jabbour and colleagues highlighted the value of using an AI model capable of combining and evaluating patient information from multiple sources to further improve diagnostic accuracy [63]. In this study, a model designed to differentiate between causes of acute respiratory failure was trained using CXRs and clinical data from electronic health records, leading to a detection accuracy similar to, or better than, clinician readers. The application of multimodal AI systems is a developing trend in medicine [104]. A summary of the clinical benefits identified in the included studies is presented in Table 5.

Study Strengths and Limitations
The strengths of this systematic review include adherence to the PRISMA guidelines and standards of the Institute of Medicine and a critical assessment of risk of bias for the included studies using a robust assessment tool, PROBAST. Another was the comprehensive search strategy applied and the replicated screening review by multiple authors of a portion of identified studies as a quality control process. Limitations of this review include the use of a detailed although unvalidated tool for the assessment of study quality and a restriction of our screened studies to the English language. Recent evidence suggests that an English language search strategy restriction is unlikely to affect results [105].

Conclusions
Deep learning has been widely applied to successfully facilitate CXR interpretation. Models have been developed to classify a wide range of pathologies, and it is evident that models are becoming progressively more clinically comprehensive. It is also apparent that classification performance is improving over time.
This review focused on machine learning devices for classification of CXRs, revealing that many such software devices have been developed since January 2020. The benefits of the devices described fall under several categories, including improved pathology detection accuracy, improved triage to reduce time to treatment for critical findings, and a reduction in reporting time.
While the benefits of these devices were well reported, the potential risks associated with their adoption remained poorly characterized, with risks only superficially noted in some primary studies and not examined explicitly. The key risks associated with these devices include the potential for dataset spectrum bias, resulting from datasets not being reflective of the real-world environment, potentially limiting their clinical application. Additionally, external validation to test model generalizability was often not reported. Another risk, particularly for less experienced clinicians, is automation bias.
The world is currently experiencing a global shortage of radiologists and increased rates of clinician burnout [106]. In the United States, the number of radiologists as a percentage of the physician workforce is decreasing, and the geographic distribution of radiologists favors larger, more urban settings [107]. Even when trained radiologists are available, CXRs are often interpreted first and acted upon by non-radiologist clinicians such as intensivists and emergency physicians [108]. In developing countries, radiology services are scarce. As of 2015, only 11 radiologists served the 12 million people of Rwanda, while the entire country of Liberia, with a population of four million, had only two practicing radiologists [108]. In our experience, in some health systems, as few as one in ten CXRs are ever reviewed and reported by a radiologist. The accurate automated analysis of radiographs has the potential to improve radiologist workflow efficiency and extend lifechanging clinical expertise to underserved regions [49]. In developing countries, solving the cost, complexity, skill requirement, and sustainability issues of radiology services has been a long-standing challenge [109,110]. The use of deep learning diagnostic adjuncts represents potential for increasing radiology capacity and providing better access to these services for patients.
The quality of clinical machine learning decision support systems is dependent upon the quality of the full product development lifecycle, from initial design to postimplementation monitoring [111,112]. Careful data curation and processing are required to ensure that data are broadly representative of clinical populations, to manage label fidelity and to ensure quality model training and validation [71]. Robust clinical evidence is required to demonstrate reliability, validity, safety, and beneficial clinical impact. Usability and interpretability for clinical end users are critical to adoption, and effective post-implementation performance and safety monitoring is key to quality management and ensuring patient care improvement [113].
The immediate future of applied machine learning in CXRs seems likely to follow the trends established in this systematic review. Broader comprehensiveness and continual improvements in model performance will approach and exceed that of human expert counterparts. In pursuit of these aspirations, we may see increasing use of novel development techniques such as generative adversarial networks (GANs) to augment training datasets and overcome the challenge of data limitations [114]. CXR data may be drawn upon by multimodal deep learning models and combined with other modalities such as ECGs to better predict specific disease states [115,116]. Early work has even shown that two-dimensional CXRs can be used to reconstruct three-dimensional CT images and improve pathology detection and classification efforts [117]. Interpretation automation may benefit patients in communities lacking radiologist expertise and where investigations presently go unreported.
Machine learning is driving the future of radiology. Developments will require shifts in clinical practice and careful risk mitigation. Radiologists need to be a part of the machine learning development process and drive the safe implementation of high-quality systems. Radiologists will play a key role in quality control and innovation as machine learning systems are applied to achieve better patient outcomes at scale.
Funding: This research and the APC was funded by Annalise.ai.

Data Availability Statement:
The datasets used and/or analyzed during the current study are available from the corresponding author on request.

Conflicts of Interest:
Employees of the funder (Annalise.ai) were involved in study design, data collection, data analysis, data interpretation, and writing of the report.